numerator: joint distribution of all random variables (can be easily computed for any setting of hidden variables.)
denominator: marginal probability of the observations (probability of seeing the observed corpus under any topic model.)
Then use posterior expectations to perform the task at hand: information retrieval, document similarity, exploration, and others.
Research Process
Strucutral Topic Model
We want to use estimates of \(\theta_d\) as the dependent variable in an regression on covariates to test whether different types of documents have different content.
This is contradictory because documents are assumed to be generated by the same statistical process.
The structural topic model (STM) of Roberts et. al. (2016) explicitly introduces covariates into a topic model, and allows one to estimate the impact of document-level covariates on topic content and prevalence as part of the topic model itself.
Topic Prevalence vs. Content
The process for generating individual words is the same as for plain LDA conditional on the \(\beta_k\) and \(\pi_d\) terms.
However both objects can depend on potentially different sets of document- level covariates. Each document has:
Topic Prevalence: Attributes that affect the likelihood of discussing topic \(k\)
Topic Content: Attributes that affect the likelihood of including term \(v\) overall, and of including it within topic \(k\)
The generation of the \(\beta_k\) and \(\pi_k\) terms is via multinomial logistic regression, which breaks local conjugacy.
Model Selection
There are three parameters we need to make assumptions about: number of topics \(K\) and priors \(\alpha, \eta\):
Priors don’t receive too much attention in literature: